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ABSTRACT 

Gene expression profiling lias been extensively used 
in the past decades, resulting in an enormous amount 
of expression data available in public databases. 
These data sets are informative in elucidating tran- 
scriptional regulation of genes underlying various 
biological and clinical conditions. However, it is 
usually difficult to identify transcription factors (TFs) 
responsible for gene expression changes directly 
from their own expression, as TF activity is often 
regulated at the posttranscriptional level. In recent 
years, technical advances have made it possible to 
systematically determine the target genes of TFs by 
ChlP-seq experiments. To identify the regulatory 
programs underlying gene expression profiles, we 
constructed a database of phenotype-specific regu- 
latory programs (DPRP, http://syslab.nchu.edu.tw/ 
DPRP/) derived from the integrative analysis of TF 
binding data and gene expression data. DPRP 
provides three methods: the Fisher's Exact Test, the 
Kolmogorov-Smirnov test and the BASE algorithm to 
facilitate the application of gene expression data for 
generating new hypotheses on transcriptional regu- 
latory programs in biological and clinical studies. 



INTRODUCTION 

In the past decade, gene expression profiling by micro- 
array and more recently by RNA-seq experiments has 



been extensively used to study transcriptional regulation, 
resulting in a plethora of expression data available in 
public databases such as the Gene Expression Omnibus 
(1). These data sets are informative in elucidating tran- 
scriptional regulation under various biological and 
clinical conditions. For example, a comparison of gene 
expression between breast cancer and normal breast 
tissues identifies differentially expressed genes (DEGs) 
that are presumably critical for carcinogenesis. 
Such gene expression alterations in response to condi- 
tional changes are programmed by a set of transcription 
factors (TFs). Unfortunately, TF activity is often 
regulated by phosphorylation/dephosphorylation and 
other posttranscriptional mechanisms and can be 
modified by mutation. Thus, it is usually difficult to 
identify TFs responsible for gene expression changes 
solely based on the expression of TFs (2^). 

In principle, the activity of TFs can be reflected by the 
expression changes of their target genes: on TF activation, 
the expression of a TF's target genes are more likely to be 
upregulated in the case of a transcriptional activator, and 
downregulated in the case of a transcriptional repressor; 
the opposite would be expected if a TF is deactivated. For 
example, we cannot consistently detect the expression 
change of the malfunctional p53 with a point mutation 
that abolishes the tumor suppressor's transcriptional regu- 
latory activity in tumor samples. However, the p53 gene 
targets are more hkely to be differentially expressed in the 
tumor sample with respect to a normal sample. Based on 
this rationale, several methods have been proposed to 
infer the regulatory activity of TFs based on the expres- 
sion change of their target genes (5-7). These methods 
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have achieved substantial success in yeast because the 
target genes for most yeast TFs have been determined 
by ChlP-chip experiments (8). Regardless, the application 
of these methods in human is limited by incomplete know- 
ledge of TF-gene interactions. In fact, to systematically 
identify human TF-gene interactions, previous studies 
have attempted to predict TF targets based on the 
existing TF binding motifs in DNA regions upstream of 
genes, albeit with a high false-positive rate (5,9). 

In recent years, technical advances have made it 
possible to systematically determine the target genes of 
TFs by ChlP-chip and ChlP-seq experiments (10,11). In 
fact, a large number of ChlP-chip and ChlP-seq data have 
been generated by large-scale projects or individual 
laboratories. For instance, the ENCODE consortium 
(Encyclopedia of DNA Elements) has generated 424 
ChlP-seq profiles, including >120 human TFs with 
various cell lines (12). Additionally, enormous amounts 
of gene expression data have accumulated over the past 
decade from studies addressing biological and chnical 
questions. The increasing availabihty of ChlP-seq data 
sets provides us with an unprecedented opportunity to 
reanalyze these gene expression data to further understand 
and dissect the regulatory networks underlying these ex- 
pression profiles. In previous reports, the ChEA databases 
collected large-scale ChlP-seq data and provided the inte- 
grative analysis of both ChlP-seq and gene expression 
data (13); the ChlP-array web server can integrate ChlP- 
seq data and gene expression profiles to construct regula- 
tory networks (14). Both databases only support Fisher's 
Exact test but database of phenotype-specific regulatory 
program (DPRP) provided three algorithms, suggesting 
better functionality and flexibihty. 

In this study, we estabhshed a second-level database, 
named the Database of Phenotype-specific Regulatory 
Programs, to facilitate the search and application of 
gene expression data for generating new hypotheses on 
transcriptional regulatory programs under diverse biolo- 
gical and clinical contexts. In the database, we have col- 
lected 984 gene expression data sets, which include 29 744 
samples. Each data set has several phenotype-specific 
subsets, and each subset is a group of samples. To study 
DEGs between two subsets, we defined the subset pair as 
two subsets within a data set. In the database, we have 
collected 984 gene expression data sets, which include 
29 744 samples and 3754 subset pairs. It contains a wide 
range of phenotypes such as disease, drug treatment and 
tissue type. Meanwhile, we have defined a collection TF- 
gene regulatory relationships containing 424 TF binding 
profiles derived from the ENCODE ChlP-seq data. We 
applied three different methods, the Fisher's exact test, 
the Kolmogorov-Smirnov test (KS test) and the Binding 
Association with Sorted Expression (BASE) algorithm we 
previously developed (7), to integrate gene expression 
profiles and the TF-gene interaction data to infer regula- 
tory networks underlying each expression data set. DPRP 
provides a user-friendly interface for generating testable 
hypotheses on transcriptional regulation underlying a 
wide range of biological and chnical phenomena. DPRP 
is freely available at http://syslab.nchu.edu.tw/DPRP/. 



MATERIALS AND METHODS 

Database construction 

Gene expression data 

We collected 984 gene expression data sets, which include 
29 744 samples. These data sets were originally generated 
to explore differential gene expressions under various con- 
ditions or treatments, e.g. gene expression changes during 
development; differential gene expression between differ- 
ent subtypes of breast cancer. Thus, each data set has 
several phenotype-specific subsets and each subset has a 
group of samples. To identify DEGs for each data set, we 
selected the subsets with at least three samples, and then 
performed t-test between each pair of subsets without 
overlapping samples. We obtained the DEGs (significantly 
upregulated or downregulated genes) for 3754 subset pairs 
representing a wide range of biological contexts 
(Supplementary Table SI). 

Phenotype annotation of gene expression data 

To systematically annotate gene expression data and 
address synonymous issues, we used the Unified Medical 
Language System (UMLS) technology that provides a 
comprehensive catalog of medical concepts (15). The 
UMLS includes Metathesaurus, semantic network and 
lexical resources. To concentrate on human disease 
study, we limited the UMLS concepts to three disease- 
related semantic types: 'Pathologic Function', Tnjury or 
Poisoning' and 'Anatomical Abnormahty'. To obtain the 
UMLS concepts for each data set, we used the UMLS 
natural language processing tool, MetaMap program 
(15), to process the summary description and the 
Medical Subject Headings of the PubMed record of the 
data set. It resulted in 4162 data set-concept relations 
including 757 distinct UMLS concepts (Supplementary 
Table S2). The phenotype annotation facihtates users to 
search specific biological or clinical concepts in enormous 
gene expression data. 

ChlP-seq data 

We downloaded 424 ChlP-seq track files from the 
ENCODE project (16), which represent the binding 
profiles of >120 human TFs in different ceU lines. Based 
on ChlP-seq data, we applied a method called Target iden- 
tification from profiles (TIP) algorithm (17) to calculate the 
binding affinity of each TF with all human RefSeq genes 
(18), resulting in a matrix containing binding affinities for 
all TF-gene pairs. TIP is a probabihstic model for the iden- 
tification of TF target genes. Moreover, TIP calculated the 
P value and the Q value for each TF-gene pair, allowing us 
to define the target gene set for each TF profile (a TF under 
a specific cell line). 

Inference of phenotype-specific regulatory programs 

We applied three different methods to integrate gene 
expression data with TF binding data to infer the regula- 
tory programs underlying expression profiles. Given a 
subset pair (e.g. estrogen treated versus untreated MCF7 
cell fines), we inferred the regulatory programs responsible 
for the DEGs. We connected the significant TFs based on 
ChlP-seq data to construct a regulatory network, in which 
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the TF^TF interactions were identified by the TIP algo- 
rithm (P < 0.01) and indicated that one regulates the tran- 
scription of the other. A brief description of these methods 
is as follows: 

Fisher's exact test 

Given a subset pair, we select the upregulated and the 
downregulated DEGs with /'<0.01. In case that the 
number of DEGs with P<0.01 is <500, we instead 
select the top 500 significant genes to ensure enough 
genes are included for stable results in subsequent statis- 
tical analyses. To estimate the significance of differential 
TF activity, we performed Fisher's exact tests to examine 
the overlap between the up-/downregulated gene set and 
TF target genes. This method requires two cutoff values: 
one is used to define the up- and the downregulated genes, 
and the other is used to define the TF target genes. A more 
detailed description of applying Fisher's exact test for TF 
activity inference can be found in previous studies (5,19). 

KS test 

Given a subset pair, we calculated the t-scores for all genes 
by comparing their expression levels between the two 
subsets. For each TF we performed KS test to compare 
the distributions of the t-scores between target genes and 
nontarget genes. To define the target gene set of a TF, we 
set the cutoff value as /'<0.01. If the number of target 
genes with P<0.01 is <500, we select the top 500 signifi- 
cant target genes for the regulatory program analysis. 
For each TF, the KS test resulted in a /" value, indicating 
the significance of its activity change, and a D value, 
indicating the direction of its activity change. A positive 
D value indicates that target genes of a TF have signifi- 
cantly higher expression levels than nontarget genes, and a 
negative D-value indicates the reverse. A similar KS test- 
based method has been proposed by Tsai et al. (6) to 
identify ceU cycle-related TFs in yeast. 

BASE algorithm 

The cutoff values for defining TF target genes and DEGs 
are usually arbitrary and hard to determine in advance. 
Comparing with the Fisher's exact test and the KS-test, 
BASE is a nonparametric algorithm that requires no 
cutoff setting for TF target genes or DEGs (20). First, we 
calculated the t-scores for all genes by comparing their ex- 
pression levels between a pair of subsets, and sorted them in 
the decreasing order to obtain a ranked gene list. Each gene 
in the list is associated with a z,-, the t-score for this gene, 
and a bi, the binding affinity of a TF to this gene calculated 
by TIP algorithm (17). Then we calculated a cumulative 
distribution function by aggregating |/,- ■ bi\ and a reference 
function by aggregating |?/|. Finally, we calculated the 
maximum deviation between the functions and applied a 
permutation-based method to normalize the score and to 
estimate its significance. The normalized score is called 
regulatory activity score (RAS), which indicates the direc- 
tion of the activity change of a TF. For a transcriptional 
activator, a positive/negative RAS indicates enhanced/ 
reduced activity of the TF, while for a transcriptional re- 
pressor, the reverse is true. A more detailed description 
about BASE can be found in (20). 



WEB INTERFACES 

We integrated gene expression data sets, phenotype infor- 
mation and ChlP-seq data sets to construct the DPRP 
database with a user-friendly web interface (Figure I). 
Users can search a disease concept to discover all related 
gene expression data sets, choose the interested data set 
and then select a subset pair within the data set for TF 
regulatory program analysis. For example, a user can type 
in 'Breast Carcinoma' as a keyword to obtain a hst of data 
sets related to breast cancer. To facihtate user-friendly text 
search, we adopted the jQuery AutoComplete technique 
to guide the user for keyword selection (http://jqueryui. 
com/autocomplete/). When a specific data set is selected, 
the database will list a number of subset pairs (e.g. breast 
cancer versus normal) for investigating regulatory activity 
of TFs. 

Given a subset pair, DPRP will generate a list of TFs 
with significant activity changes. To visualize the TF regu- 
latory program, the web server draws a regulatory TF 
network consisting of aU significant TFs, in which the 
TF^TF interaction indicates that one regulates the tran- 
scription of the other, which is identified by the TIP algo- 
rithm (/'<0.01) from ChlP-seq data. The ChlP-seq data 
support TF-gene regulations in different cell fines. Users 
can select a specific cell line to display a cell line-specific 
network or use all cell fines to display an integrated 
network. Some cell lines only have a few ChlP-seq experi- 
ments, which is not sufficient for TF network construc- 
tion. Thus, the web interface only allows users to select cell 
fines with at least 12 ChlP-seq experiments. Moreover, 
users can upload their own gene expression data onto 
the database, and then DPRP will perform Fisher's 
Exact test, KS test and BASE algorithm in the data. 

EXAMPLE APPLICATIONS 

To demonstrate the biological importance of DPRP, 
we used GDS3283 and GDS3044 as examples to show 
the cell-specific regulatory TF networks (Figure 2). 
GDS3283 is a gene expression data set with estradiol treat- 
ment using MCF7 breast epithelial cancer cells (21). The 
BASE algorithm identifies 74 TFs with significantly differ- 
ential activity (Q< 0.001), in which the most significant 
TF is estrogen receptor alpha (ESRl) based on ChlP-seq 
experiments carried out in T47D cells (Figure 2A). 
Obviously, this result is consistent with our knowledge 
that the estradiol treatment significantly induces the 
activity of ESRl in breast cancer cells. Interestingly, 
when the expression levels of ESRl are compared 
between estradiol treatment and control samples, we 
cannot detect significant expression change of ESRl in 
GDS3283. Thus, the BASE algorithm identified the key 
regulator that cannot be discovered by differential expres- 
sion analysis. 

Since the GDS3283 is a breast cancer data set, we 
selected the ChlP-seq experiments from breast cancer 
cell fines (T47D and MCF7 cells) to generate the regula- 
tory TF network with Q < 0.001 (Figure 2B), which con- 
tained five significant TFs: ESRl, GATA3, FOXAl, 
MYC and E2F1. In this regulatory network, ESRl, 
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Figure 1. An overview of the DPRP web interface. (A) Users can perform a query by the following procedures: (i) Users can input a disease name in 
the auto-completed keyword field, which provides a list of partially matched UMLS concepts for selection. Alternatively, users can also input a data 
set ID in the keyword field to select a specific data set. (ii) After UMLS concept selection, the data sets associated with the selected concept will be 
shown in a data set list, from which the user can select the data set of interest, (iii) Given a specific data set, the subset pairs from the selected data 
set will be displayed in a subset list, and then the user can select the subset pair to search TF regulatory programs. DPRP provides three different 
methods to rank the potential TFs, in which users can determine which ranking guidelines to use. In addition, users can upload their own gene 
expression data with gene list and t-value of t-test or log ratios between two subsets. (B) The database integrated gene expression data and ChlP-seq 
TF binding data to identify the regulatory programs underlying a selected phenotype pair. (C) The output web pages: DPRP generates a list of the 
TFs and ranks them by their P values or Q values. In TF table view, users can export the table of candidate TFs as a text file. Based on the ranked 
TF list, DPRP generates a regulatory network consisting of all significant TFs, in which users can export the TF network as a png, svg or xml file. 



FOXAl and GATA3 formed a tight regulatory module. 
These results are consistent with the finding by Kong et al. 
(22) that FOXAl and GATA3 are essential co-regulators 
in estrogen response pathway and that ESRl, FOXAl 
and GATA3 formed an enhanceosome in breast cancer 
cells. Activation of MYC and E2F1 may indicate that 



estrogen treatment can promote cell prohferation of 
MCF7 cells. 

DPRP also provides insights into drug mechanisms. 
GDS3044 is a gene expression data set in K562 (the 
leukemia cell line) cells treated with imatinib. Figure 2C 
shows that the imatinib treatment significantly increases 
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Figure 2. Example applications. (A) The complete regulatory TF network associated with estradiol treatment in MCF7 cells from the GDS3283 data 
set. The network contains 74 significant TPs identified from ChlP-seq data in all cell hues, in which the most significant TF is ESRl. This is the 
regulatory network output by the BASE method with Q < 0.001, when users select the GDS3283 data set and subset pair 'estradiol treatment versus 
control'. (B) The regulatory TF network specific for T47D + MCF7 cell lines. In the network, only the significant TFs with ChlP-seq data from T47D 
and MCF7 are displayed. (C) The regulatory TF network associated with imatinib treatment in K562 cells from the GDS3044 data set. This is the 
output by the BASE method (Q < 0.001), when users select the GDS3044 data set with subset pair 'imatinib treatment versus control', and then select 
the K562-specific TF network. The network contains 43 significant TFs, in which the most significant TF is TALI. 



the activity of three TFs: GATAl, GATA2 and TALI 
in K562 cells. It is known that imatinib inhibits the 
kinase activity of BCR-ABL protein, which is the 
pathophysiologic cause of chronic myelogenous 
leukemia. Previous studies have shown that BCR-ABL 
suppresses the GATAl activity, and thus explains why 
we observed an increased activity of GATAl in response 
to imatinib treatment (16,23). In addition, TALI, the 
T-cell acute lymphocytic leukemia protein 1 , is specifically 
expressed in early erythroid ceUs and interacts with 
GATAl (24), which also supports our result. 



DISCUSSION 

In this study, we applied three different methods to infer 
the regulatory programs underlying given gene expression 
profiles. To apply the Fisher's exact test, DEGs have to be 
defined based on the gene expression data. At the same 
significance level, the numbers of DEGs vary substantially 
in different gene expression data sets, depending on the 
quahty of the data and the sample size. As a consequence, 
we expect variability in statistical power and robustness of 
Fisher's exact test. Similarly, the effectiveness of this 
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method is also influenced by the number of TF target 
genes. TFs with more target genes are more hkely to be 
identified as significant TFs. The KS-test does not require 
the up-/downregulated gene sets, but it stiU requires the 
predefined TF target genes. The BASE method requires 
neither a differential gene set nor a target gene set, and 
thus is more convenient in practice and does not have the 
bias issue. However, it estimates the significance of the TF 
RAS using the permutation of gene expression profiles 
(shuffle all genes in the profile), which often overestimates 
their significance. Because none of these methods is 
perfect, we provide the results from all three methods in 
the DPRP database. This aUows users to determine the 
stringency level and make decisions according to their 
own requirements, e.g. selecting the significant TFs 
identified by all methods to obtain a TF list of high 
confidence. 

Currently, we have included the ChlP-seq data 
generated by the ENCODE project in our database. 
There are many ChlP-seq and ChlP-chip data sets that 
have been generated by other large-scale projects or by in- 
dividual laboratories that will be included in the database. 
Moreover, we anticipate that an increasing number of TF 
ChlP-seq data wiU be generated in the near future. We will 
maintain our database with routine updates to ensure that 
we maintain a comprehensive hst of TFs. We beheve 
DPRP will be a useful database and resource for biolo- 
gical and clinical studies. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Onhne. 
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